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Abstract Community Discovery in complex networks is the problem of de- 
tecting, for each node of the network, its membership to one of more groups of 
nodes, the communities, that are densely connected, or highly interactive, or, 
more in general, similar, according to a similarity function. So far, the prob- 
lem has been widely studied in monodimensional networks, i.e. networks where 
only one connection between two entities can exist. However, real networks are 
often multidimensional, i.e., multiple connections between any two nodes can 
exist, either reflecting different kinds of relationships, or representing different 
values of the same type of tie. In this context, the problem of Community 
Discovery has to be redefined, taking into account multidimensional structure 
of the graph. We define a new concept of community that groups together 
nodes sharing memberships to the same monodimensional communities in the 
different single dimensions. As we show, such communities are meaningful and 
able to group highly correlated nodes, even if they might not be connected in 
any of the monodimensional networks. We devise ABACUS (Apriori-BAsed 
Community discoverer in multidimensional networks), an algorithm that is 
able to extract multidimensional communities based on the apriori itemset 
miner applied to monodimensional community memberships. Experiments on 
two different real multidimensional networks confirm the meaningfulness of 
the introduced concepts, and open the way for a new class of algorithms for 
community discovery that do not rely on the dense connections among nodes. 
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Fig. 1 Example of multidimensional networks 

Keywords Community discovery • Multidimensional Networks • Social 
Network Analysis 



1 Introduction 

Inspired by real-world scenarios such as social networks, technology networks, 
the Web, biological networks, and so on, in the last years, wide, multidis- 
ciplinary, and extensive research has been devoted to the extraction of non 
trivial knowledge from networks. Predicting future links among the nodes or 
actors of a network ([10 ), detecting and studying the diffusion of informa- 
tion among them ([19,25 ), mining frequent patterns of nodes' behaviors ([SJ 
[40l[T6]), are only a few examples of tasks in the field of Complex Network 
Analysis, that includes, among all, physicians, mathematicians, computer sci- 
entists, sociologists, economists and biologists. The data at the basis of this 
field of research is huge, heterogeneous, and semantically rich, and this allows 
to identify many properties and behaviors of the actors involved in a network. 
One crucial task at the basis of Complex Network Analysis is Community 
Discovery, i.e., the discovery of group of nodes densely connected, or highly 
related. Many techniques have been proposed to identify communities in net- 
works ([23,18 ), allowing the detection of hierarchical connections, influential 
nodes in communities, or just groups of nodes that share some properties or 
behaviors. In order to do so, the connections among the nodes of a network 
were so far posed at the center of investigation, since they play a key role in 
the study of the network structure, evolution, and behavior. 

Nowadays, most of the work done in the literature is limited to a very 
simplified perspective of such relations, focusing only on whether two nodes 
are connected or not, and possibly assigning a strength to this connection. 
In the real world, however, this is not always enough to model all the avail- 
able information about the interactions between actors, including their multi- 
ple preferences, their multifaceted behaviors, and their complex interactions. 
While multiple types of connections among actors could still be represented 
into a monodimensional network, by collapsing all connections to one type and 
potentially affecting a measure of tie strength, a more sophisticated analysis of 



ABACUS: Apriori-BAsed Community discovery in multidimensional networks 



3 




Fig. 2 An example of multidimensional community found in DBLP by our algorithm ABA- 
CUS, that other methods are not able to detect. Nodes in the community: Amit Agarwal, 
Qikai Chen, Swaroop Ghosh, Patrick Ndai, Kaushik Roy. 

the network structure, which could maintain information on the semantic dif- 
ferences in how actors are connected, would help all the techniques to provide 
more meaningful communities. 

To this aim, in this paper we deal with multidimensional networks^ i.e. 
networks in which multiple connections may exist between a pair of nodes, re- 
flecting various interactions (i.e., dimensions) between them. Multidimension- 
ality in real networks may be expressed by either different types of connections 
(two persons may be connected because they are friends, colleagues, they play 
together in a team, and so on), or different quantitative values of one specific 
relationship (co-authorship between two authors may occur in several different 
years, for example). 

This distinction is reported in Figure [l] where on the left we have different 
types of links, while on the right we have different values (conferences) for one 
relationship (for example, co-authorship). Note that, while using the years (or 
any temporal snapshot) as different dimensions is a typical way of achieving 
the above, time is not the only kind of relationship that can have different val- 
ues to use as dimensions: weight, height, ranking (any kind of), and any other 
measurable relationship. We can also distinguish between explicit or implicit 
dimensions, the former being relationships explicitly set by the nodes (friend- 
ship, for example), while the latter being relationships inferred by the analyst, 
that may link two nodes according to their similarity or other principles (two 
users may be passively linked if they wrote a post on the same topic). 

In this scenario, we deal with the problem of Multidimensional Community 
Discovery, i.e. the problem of detecting communities of actors in a multidi- 
mensional network. We define a new concept of multidimensional community 
that groups nodes sharing their membership to different communities in sin- 
gle dimensions. This concept gives us the possibility to leverage traditional 
monodimensional community discovery algorithms. It then allows us to define 
the lattice of multidimensional communities as function of the subset of di- 
mensions for which the monodimensional community memberships of nodes 
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are shared. Each multidimensional community can then be represented by the 
associated subset of dimensions, providing a semantic meaning to the commu- 
nity. Note that while the problem of finding cross-dimensional or cross-network 
structures is not new [T2l|6l[24], our definition of multidimensional community 
differs from the previous ones. In fact, using this definition, a multidimen- 
sional community could be unconnected, i.e. composed of nodes which are 
not directly connected in any of the dimensions. This represents a complex 
phenomenon that can be seen in the real world: not all the people in a so- 
cial community are necessarily connected directly, and, if they share their 
memberships in more than one dimension, they can be seen as a (potentially 
unconnected) group of highly related (both positively and negatively) people. 

We devise ABACUS (Apriori-BAsed Community discoverer in multidi- 
mensional networks), an algorithm that extracts multidimensional communi- 
ties such as the one in Figure |2] working in four steps: 

1. Each dimension is treated separately and monodimensional communities 
are extracted 

2. Each node is labeled with a list of pairs (dimension, community the node 
belongs to in that dimension) 

3. Each pair is treated as an item and a frequent itemset pattern mining is 
applied 

4. Frequent closed itemsets represent multidimensional communities described 
by the itemsets 

ABACUS is based on existing monodimensional algorithms for community dis- 
covery (used as a parameter), and on the execution of the apriori algorithm to 
extract frequent itemsets, that, in our scenario, represent the multidimensional 
description of the communities. 

Our main contribution can be then summarized as follows: we introduce the 
new concept of multidimensional communities, and the ABACUS algorithm to 
extract them (Section [3|; we show the applicability of ABACUS to real world 
multidimensional networks (Section [4|, together with a comparison with a 
previous approach to the problem of multidimensional community discovery. 

2 Multidimensional networks 

In the world as we know it we can see a large number of interactions and 
connections among information sources, events, people, or items, giving birth 
to complex networks. Enumerating all the possible networks detectable within 
our world, or their properties, would be difficult due to their number and het- 
erogeneity, and it is not the scope of this paper. An excellent survey on complex 
network can be found in [27], where the author gives a good classification of 
networks into social (where, for example, we find on-line social network such as 
Facebook), information (such as for example citation networks), technological 
(among which we mention the power grid, the train routes, or the Internet), 
and biological (e.g., protein interaction networks) networks. 
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(a) small DBLP extract (b) small Query Log extract 



Fig. 3 Small extracts from the multidimensional DBLP and Query Log networks. Edge 
colors correspond to dimensions in the network, i.e. distinct conferences for DBLP, and 
binned ranks for Query Log. 



While all the example networks presented in [27] are mono dimensional, in 
the real world it is possible to find many multidimensional networks: trans- 
portation networks (transport means are different dimensions), social networks 
(different online services may be seen as different dimensions connecting the 
same users), co-authorship networks (different venues as dimensions), consti- 
tute a short, non-exhaustive list of possible real- world examples. 



2.1 A model for multidimensional networks 



In its classical definition, a network is defined as a structure that is made up of 
a set of entities and connections among them. We want to extend this definition 
by allowing connections of different kinds, that we call dimensions. We want 
to emphasize that each dimension corresponds to a different perspective of the 
network connectivity structure. 

We use a multigraph to model a multidimensional network and its prop- 
erties. For the sake of simplicity, in our model we only consider undirected 
multigraphs and since we do not consider node labels, hereafter we use edge- 
labeled undirected multigraphs^ denoted by a triple Q = (V^E^L) where: V is 
a set of nodes; L is a set of labels; £^ is a set of labeled edges, i.e. the set of 
triples (ix, d) where u^v are nodes and d G L is a label. Also, we use the 
term dimension to indicate labels and we say that a node belongs to or appears 
in a given dimension d if there is at least one edge labeled with d adjacent to 
it. We also say that an edge belongs to or appears in a dimension d if its label 
is d. We assume that given a pair of nodes u^v ^ V and a label d e L only 
one edge {u^v^d) may exist. Thus, each pair of nodes in Q can be connected 
by at most \L\ possible edges. 
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2.2 Real world dataset 

We created two multidimensional networks from the well known digital bibli- 
ography database DBLFQ and from a search engine query lo^ 

— DBLP We extracted author-author relationships if two authors collabo- 
rated in writing at least one paper. The dimensions of this network are 
defined as the venues in which the paper was published, resulting in 2,536 
conferences that took place in years 2000-2010 (all the editions of a con- 
ference are considered as one dimension). We weighted each edge by the 
number of papers published by the two connected authors in the same 
conference (dimension). The final network consisted of 558,800 nodes, con- 
nected by 2,668,497 edges in 2,536 dimensions. A small extract of this 
network is represented in Figure [sj^a). Figure [4]^a) reports the distribution 
of the number of edges per dimension (the dimensions are sorted by the 
values of the y axis) . High number of edges corresponds to high number of 
editions of a conference and/or high number of published papers and/or 
high co-authorship number per paper. 

— QueryLog. This network was constructed from a query log of approxi- 
mately 20 millions web-search queries submitted by 650,000 users, as de- 
scribed in [30 . We extracted a word- word network of query terms (nodes), 
connecting two words if they appeared together in a query. The dimensions 
are defined as the rank positions of the clicked results, grouped into six al- 
most equi-populated bins: "Binl" for rank 1, "Bin2" for ranks 2-3, "Bin3" 
for ranks 4-6, "Bin4" for ranks 7-10, "Bin5" for ranks 11-500. Hence two 
words appeared together in a query for which the user clicked on a resulting 
url ranked #4 produce a link in dimension "Bin3" between the two words. 
We weighted each edge by the number of queries in which the two con- 
nected words appeared together in the same dimension. The final network 
consisted of 131,268 nodes, connected by 2,313,224 edges in 5 dimensions. 

^ http: / / dblp.uni-trier.de 

^ http:/ /www. gregsadetsky.com/aol-data 
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Fig. 4 Distribution of the number of edges in each dimension in DBLP and Query Log. 
Rank of the dimensions based on number of edges. 
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A small extract of this network is represented in Figure [sj^b). Figure |4]^b) 
reports the distribution of the number of edges per dimension. 

Following the classification in |27j , we took one social and one information 
network, with different features (semantic, number of dimensions, number of 
nodes, types of dimensions -i.e., categorical vs numerical attributes). Although 
they do not cover the entire space of possible networks, their different features 
are representative of several different real world networks [27^ 

3 The ABACUS framework 

In this section we present the core theoretical concepts of our problem. After 
defining the types of communities we are seeking, we introduce an apriori-based 
algorithm for their extraction. We show how the apriori strategy well suits the 
search for our kind of patterns, with the help of a run-through example on a 
toy dataset. 

3.1 A new concept 

As said above, most of the existing approaches to the problem of community 
discovery rely on a concept of community which is structure-based. That is, 
nodes with dense connections (or high interaction) are grouped together (in 
some cases, overlapping communities are also discovered). In this paper, we 
change this perspective. Let us start with a real-world example. In the WWW 
context, nowadays it is very popular to be connected in services like Facebook, 
Twitter, Google+ and, possibly, all of them. Each of these services sees differ- 
ent communities that can be spotted within their sets of nodes. As today many 
of the users have their online identities replicated across the different social 
networks, it is very likely that people sharing their membership in community 
k in service 5, are also sharing their membership in community k' in service 
s' . Extending this, we can easily imagine that many communities (especially 
small ones) would be exactly replicated across different dimensions. 

In addition to this, there is another effect that can be detected in the real 
world. Even within close circles of friends, it usually happens to see pairs of 
people which are not directly connected. There can be many reasons for this: 
they can be enemies, or potential friends not yet connected, or there can be 
obstacles for their connection to setup (in some example networks, spatial 
constraints may inhibit people living too far away from connecting to each 
other). Yet, in these cases, two or more persons can share their memberships 
to communities in different contexts, or social networks, or, more in general, 
dimensions. 

Two nodes A and B can then end up being logically connected by their 
shared memberships (say, to community 3 in dimension GoogleH- and to com- 
munity 4 in dimension Facebook), but never actually connected in any di- 
mensions in which they appear. This concept of logical connection here is 
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crucial. While in previous community discovery algorithms, mono dimensional 
approaches have a limited view of the rich set of connections residing within 
nodes, disregarding the additional information provided by multiple dimen- 
sions would be very limitative. Let us consider a co-authorship graph in DBLP, 
where each conference is a different dimension. Two persons in such network 
can be easily spotted to have connections in conferences such as KDD, VLDB, 
and SDM, while they are not connected, or not even present, in other dimen- 
sions such as AAAI, or SIGGRAPH, and so on. This piece of information is 
usually lost in traditional algorithms working on monodimensional networks, 
and, unfortunately, weights do not help in conveying entirely this additional 
knowledge. 

On the other hand, if we use the shared memberships as key concept for 
connecting people (thus, not necessarily directly connected), we are linking 
them logically, using the semantic residing in the dimensions. 



3.2 From communities to itemsets 

Following the above idea, we can proceed in our extraction as follows. First, 
we can split a multidimensional network into several monodimensional ones. 
We can then perform any existing technique for monodimensional community 
discovery, obtaining, for each node of the original network, a set of member- 
ships to communities in each single dimension. We are now using the nodes 
as transactions of items, where an item is a pair {dimension^ community) ex- 
pressing the membership of the node in the various dimensions. At this point, 
applying the apriori strategy to find frequent closed itemsets [1 appears to 
be natural. There is, in fact, a natural mapping of almost all the concepts in 
the frequent itemset mining paradigm in our problem: nodes are transactions; 
memberships are items; multidimensional communities are itemsets; the sup- 
port of an itemset is the number of nodes sharing that set of memberships, 
and so on. Even the constraint-based paradigm has a role in our problem: 
one can, in fact, use constraints on the itemsets (eg. excluding/including spe- 
cific items, computing any monotonic or convertible measure on itemsets, and 
so on). For the sake of simplicity, we reserve for future work this part of the 
problem, and we focus only on the extraction of frequent closed itemsets [4 . In 
this new domain, it is also necessary to define concepts for a common under- 
standing. With the term of support^ we intend the number of nodes that are 
members of a given multidimensional community. For instance, in the case of a 
co-authorship multidimensional network, two is the support of a multidimen- 
sional community of which its members are two different authors. Moreover, 
the size represents the number of different dimensions involved in a multidi- 
mensional community. Again in the multidimensional co-authorship network, 
two is the size of a multidimensional community composed of two dimensions 
such as two conferences. 

Before explaining why we chose to extract closed itemsets, let us follow a 
run-through example of our search strategy. Figure |5] describes our toy input 
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KDD 
VLDB 





Fig. 5 Run-through example: co-authorship network with two dimensions: KDD and VLDB 
(top), monodimensional overlapping communities (bottom) 



network (top), consisting of five nodes, connected in two different dimensions 
(KDD and VLDB). From the top image to the ones below, we perform two 
steps: first, we split the multidimensional network into two monodimensional 
ones; then, we perform the community discovery on each of them. The algo- 
rithm finds two different communities (highlighted by dashed lines of different 
colors) in each of the dimensions. The output of this process is represented on 
the left of Figure |6] that shows the list of transactions that it is possible to 
build from the memberships of the five nodes. The right part of Figure |6] shows 
then how the lattice of multidimensional communities is created. In this lat- 
tice, we imagined to have also a third dimension, namely PKDD, with a single 
community with a single person in it, that, clearly, gets cut from a minimum 
support threshold a = 2. In bold black we have the frequent itemsets, while 
in red we highlighted the closed frequent ones. Finally, we see that the closed 
frequent itemsets clearly summarize the entire set of frequent itemsets found, 
so it would be redundant to return also non-closed items, as they would be 
sub-patterns of richer patterns, in terms of semantics of the involved items. 



3.3 ABACUS : an Apriori-BAsed Community discoverer in multidimensional 
networks 

Algorithm [l] is the core of our approach. It takes in input three parameters: 
the multidimensional network a monodimensional algorithm for community 
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TID 


ITEMS 


1 


ABC 


2 


BCE 


3 


ABCE 


4 


BE 


5 


ABCE 




Fig. 6 Run-through example: set of monodimensional communities (items) associated to 
each node (transaction), on the left; lattice of multidimensional communities extracted using 
apriori algorithm, on the right. 



discovery CD^ and a minimum support threshold a. The algorithm works 
by building a set of transactions memberships that, for each node n, record 
a set of pairs {i^j) representing memberships on node n to community j in 
dimension i. Note that if CD is able to find overlapping communities, one node 
could have more than one pair associated to a specific dimension. However, 
for sake of simplicity, and without lack of generality, in the rest of the paper 
we assume to work with a non-overlapping community discovery algorithm. 

In line 4 the function cj) is called to split the multidimensional network 
into a set of monodimensional ones, by replicating each node into each of the 
dimensions in which it has at least on edge, and adding to it all of its adjacent 
edges in their corresponding dimensions. Each dimension is then processed as 
a separate network G by CD^ returning a different set of communities per 
dimension. In lines 7—10, for each node in each community, its member- 
ships are updated with the pair {dimension^ community)^ building a set of 
transactions (one per node). Such set is then passed to apriori in line 13, to- 
gether with a threshold of minimum support, and the resulting set of frequent 
itemsets and their corresponding transaction ids are returned: the former con- 
stituting the multidimensional description of each community (as a set of pairs 
{dimension, community)) , the latter constituting the set of nodes contained 
in each community (i.e., the ids of the transactions supporting the frequent 
itemset). 

The complexity of ABACUS is directly inherited by the complexity of 
the apriori algorithm used, and by that of the method for monodimensional 
community discovery. The additional complexity introduced by ABACUS , in 
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Algorithm 1 ABACUS 

Require: Q,CD,a 

1: for all n G nodes(Q) do 

2: member ship s[n\ = 

3: end for 

4: {G} ^ ^{G) 

5: for all d G {G} do 

6: {c} ^ CD(G,) 

7: for all Cj G {c} do 

8: for all n G nodes (cj) do 

9: member ^ mem6ers/izps[n] U 

10: end for 

11: end for 
12: end for 

13: {{i-set}, {t-id}) ^ apriori(memberships, a) 
14: return ({i^set}, {tJd}) 



fact, resides only in the problem- mapping phase, where we perform a linear 
scan of the list of communities found and we prepare the input for the apriori. 
We then refer to the corresponding papers for discussion on the complexity, 
although in Section 4.3 we present an empirical evaluation of the complexity 
of ABACUS. 



4 Case study on DBLP and Query Log 

4.1 Tools 

We have implemented ABACUS in C++, making use of the igraplj^ library. 

As CD parameter, we use the community discovery algorithm based on 
label propagation [33 . This algorithm is well known to be scalable, and, as 
a result, our running times to process the network were considerably low (a 
few seconds up to the creation of the transaction file, plus a few minutes to 



perform apriori, see Section 4.3 for running times). In all the experiments we 
set the minimum support threshold to 2, in order to capture all the possible 
connections among nodes. 

We chose eclat [17] as efficient implementation of the apriori algorithm, 
that returns the list of supporting transactions for every itemset. 

Note that many other choices are possible for the CD and the apriori steps 
and that, for sake of simplicity and presentation, we only report the results 
obtained by the above choice. We leave for further research the investigation 
on the sensitivity of our approach to the choice of the implementation of these 
two steps. Note also that while the choice for the apriori implementation is 
usually mainly driven by scalability issues, selecting a different algorithm for 
community discovery may lead to very different communities. The debate on 
which algorithm to choose is however out of scope in this paper, and we refer 



^ http://igraph.sourceforge.net 
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to Section [5] and to the surveys on community discovery for driving the reader 
to the best choice for this step, which is mainly driven by the final application 
[181I23]. 

All the experiments were performed on a laptop equipped with an Intel 17 
processor at 2.2GHz, with 4GB of RAM. 



4.2 Experiments 

We performed our experiments following four questions related to our problem: 

Ql. Quantitative evaluation: can we spot regularities and anomalies in the so- 
lution space of the apriori algorithm? Can we measure and identify such 
anomalies? 

Q2. Selection of results: given the high number of resulting communities, is 
there any easy (to compute) and meaningful way to reduce the patterns, 
possibly during the search step of the apriori? 

Q3. Making sense of the density: since we do not rely necessarily on connect- 
edness, can we identify relational dependencies between our concept of 
communities and structural properties of them? 

Q4. Qualitative valuation: among the communities found, are there any relevant 
ones? Can we reason on the multidimensional density of the connections 
within the communities? 

In order to answer the above questions, we define a simple and easy to 
compute measure of connectedness within communities. The Multidimesional 
Community Density (MCD) is then the number of edges in a community 



Cumulative distributions 




0.6 I ^ ^ ^ ^ 1 

0.2 0.4 0.6 0.8 1 

MCDensity 



Fig. 7 Cumulative distributions of Multidimensional Community Density (MCD) for the 
two networks 
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normalized by the maximum possible for that community, or, in formula: 

#edges 

ndim X *nodesx{^nodes-l) 

where ndim is the number of different dimensions found in the community. 

By applying the ABACUS algorithm to the DBLP multidimensional net- 
work, we obtained 484,833 multidimensional communities with at least size 
2, while we found 14,415 communities in Query Log (the high unbalance is 
due to the high unbalance of the number of dimensions in the two datasets). 
We address later in the paper the problem of filtering this large number of 
resulting communities. 

Figure [7| shows the cumulative distributions of MCD on the resulting com- 
munities. As we can see, the line corresponding to Query Log is globally under 
the one corresponding to DBLP. One possible explanation for this is the higher 
number of edges per dimension in Query Log (see Figure [4|. 

The distributions of the support (MCS, number of nodes in a community) 
and the size of the patterns (see Figure [sj top row for MCS, bottom row for the 
size) were in line with the literature of the applications of Frequent Pattern 
Mining. MCS ranged from 2 to 216 for DBLP, and from 2 to 70,303 for Query 
Log. The size ranged from 1 to 63 for DBLP, and from 1 to 5 for Query Log 
(note however that we are not very interested in the results with size equal to 
1, as they are truly monodimensional communities). 

Now, we want to answer Q2. The Frequent Pattern Mining literature re- 
ports that the problem of finding few relevant patterns to be interpreted, 
among the many returned, is hard pT| [2T ] . We can overcome to this problem 
in three different ways. First, we can look at the distributions of the MCD 
(defined above), the support of the patterns, and the size of the itemsets to 
focus our search towards the communities that we consider relevant, depend- 
ing on the final application. Figures [7) |8] report the mentioned distributions 
(we report the cumulative versions, to be able to use the three measures as 
straightforward filters). For better comparison, we reported on the y-axes the 
percentage of communities with values of the measures greater than a cer- 
tain thresholds. However, the absolute number of communities can be used to 
choose, depending on the application, the best support, size of the itemset, 
and MCD to select only the relevant communities. Second, more generally 
speaking, the entire Constraint-Based Frequent Pattern Mining literature can 
be applied in our scenario at running stage, to drive the search to fewer, more 
focused, patterns [32"'3U[9 . For example, we may want only patterns including 
or excluding a specific dimension, or patterns including dimensions with spe- 
cific properties (e.g., at least 1000 authors). To this extent, it is worth noting 
that MCD is neither (anti-) monotone, nor convertible, nor loose-antimonotone. 
We leave for further research the definition of meaningful, application-driven 
constraints, and their effect to the results. Lastly, the authors of [TT] present 
another methodology for selecting few interesting patterns among many, which 
is not based on constraints. We believe that this technique may be also used, 
and we plan to investigate this opportunity in the future. 
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Fig. 8 Cumulative distribution of support (top row) and size of the itemsets (bottom row), 
for DBLP (left column) and Query Log (right column) 



To answer Q3 we check whether MCD can be found as correlated to other 
structural properties of the nodes of the communities. For example, one possi- 
ble intuition is that communities with low density may group together nodes 
that were at the borders of the monodimensional communities. To study this, 
we computed the closeness centrality for each node and for each dimension, 
and checked the correlation between the centrality and the density. We did not 
find any clear sign of direct correlation. We checked also for correlation with 
PageRank, the degree centrality and the betweeness centrality, for which again 
we did not have signs of correlation. In future studies, we will investigate the 
relationships with MCD and other measures, or other kind of patterns (e.g., 
frequent subgraphs). 

Lastly, in order to answer Q4, we extracted a few communities either mini- 
mizing or maximising MCD. As we have stated above, we can use the distribu- 
tions of size, support and MCD to post-process the results to get only the few 
interesting ones. We have extracted a few (i.e., 200) communities for each net- 
work, and we report in Figure [9] four of them. Besides the first example, that 
was found by searching for one of the co-authors of this paper, the other ones 
were found by examining the results filtered by means of the above mentioned 
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three measures. In particular: Figure |9[b) was found within 260 communi- 
ties obtained by constraining MCD < 0.1, MCS > 2 and size > 3; Figure 
was found among 287 communities obtained by constraining MCD = 1, 
MCS > 3 and size > 2; Figure |9jd) was found within 286 communities ob- 
tained by constraining MCD < 0.5, MCS > 4 and size > 4. These thresholds 
were obtained by looking at the distributions reported above. 

Consider the one in Figure|9ja). We discovered a size-4 community connect- 
ing FP, FG, MN and DP with dimensions set {KDD, CIS, SAC, SEED}. It is 
interesting to observe that, given its very dense connections, this multidimen- 
sional community would have been found also by using the method proposed 
in ^. 



KDD 




(a) MCD=1 in DBLP (b) MCD=0.075 in DBLP 



Connections in all 5 dimensions 




(c) MCD=1 in Query Log (d) MCD=0.33 in Query Log 



Fig. 9 Four communities with high and low MCD extracted from the two networks. Nodes 
in (a): Fosca Giannotti, Mirco Nanni, Dino Pedreschi, Fabio Pinelli. Nodes in (b): Amit 
Agarwal, Qikai Chen, Swaroop Ghosh, Patrick Ndai, Kaushik Roy. The colored dashed ovals 
represent the shared memberships to the same community in the corresponding dimensions 
(see circle labels). The dashed anonymous nodes in (b) represent several nodes belonging 
to the communities in dimensions JOLTS, ISLPED and DATE and are not visualized to 
simplify the readability. 
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However the method proposed in this paper has the possibihty to dis- 
cover more complex interactions between dimensions. Indeed, the lattice can 
be used to browse the multidimensional communities by selecting different 
dimensions sets. To give an example we extracted a size- 3 community com- 
posed of authors AA, PN, QC, KD and SG with connections in dimensions set 
{lOLTS, DATEJSLPED}, see Figure [9];b) where the different multidimen- 
sional memberships are shown. These authors are part of three monodimen- 
sional communities, but have not co-authored papers at these three confer- 
ences (there are no links connecting them). By adding the dimension ICCD 
(in red), we are able to extract a size-4 community composed of the first three 
authors. This fourth dimension includes a paper co-authored by the three au- 
thors, which resulted in a /CCI^-monodimensional community formed by the 
three nodes. Interestingly, through this dimension we are able to specialize 
the previously discovered 5 authors community. Note that using the method 
proposed in [6] it would not be possible to discover how the ICCD dimension 
could specialize the community, and so its semantic meaning. This is due to 
the fact that more information is included in the results w.r.t. the mentioned 
work. 

Similar results can be obtained applying ABACUS to the Query Log dataset. 
As shown in Figure [9]^c), maximising the MCD, we were able to detect a highly 
connected multidimensional community where the words Mach% Picchu^ and 
unbelievable are connected in three different dimensions of the dataset. As for 
the DBLP dataset, ABACUS was able to find on the Query Log dataset a 
more complex multidimensional community, shown in Figure [oj^d). In this ex- 
ample, we can observe that, if we consider all the dimensions, we obtain a set 
of words belonging to the same multidimensional community with a strong in- 
trinsic semantic correlation (i.e. Pablo, Picasso, Neruda -besides sharing their 
first name, there exists an edition of a book from Neruda with a Picasso paint- 
ing on the cover), removing, then, the most specific dimension (Bin 1 - i.e. 
click on the first returned result) we include more vague words. Also in this 
case, the method proposed in [6] does not allow to investigate the effect of the 
different dimensions on the specialization of the communities and, thus, the 
intrinsic semantic correlation among different words. 



4.3 Comparison with previous approaches 

In [6] , the authors proposed another way to extract multidimensional commu- 
nities. Their approach is based however on a different concept of communities: 
a multidimensional community groups nodes that are highly multidimension- 
ally connected. How this multidimensional connectedness is evaluated is left at 
the end of the process, by post-processing the resulting communities. Their ap- 
proach is composed by the following steps: first, the multidimensional network 
is collapsed to a monodimensional one (i.e., they follow exactly the opposite 
of our first step), by weighing the edges in different ways; second, monodi- 
mensional community discovery is performed on the resulting network; on the 
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Fig. 10 Run-through example: Monodimensional community discovery applied to a col- 
lapsed multidimensional network, where edges are weighted by the number of dimensions. 



resulting communities, multidimensional connections are restored from the 
original networks; the communities are then evaluated by means of multidi- 
mensional measures. 

Applying their strategy to our example, we would collapse the multidimen- 
sional network of our run-through example into a weighted monodimensional 



as depicted in Figure 10 In this example, a CD algorithm will find only one 
community containing all the nodes. Without a manual postprocessing step 
(e.g., reintroducing all the edges, or, equivalently, relabeling the edges to rep- 
resent the original multidimensional information) it would be impossible to 
find the subcommunity containing only the nodes 1, 3, and 5, which instead 
is automatically detected using our method. 

At this stage, we wanted to compare the two approaches at different levels. 
In particular we wanted to answer the following: 

Q5. Quantitative valuation: how do the two sets of communities found com- 
pare? Can we measure their intersection and the number of communities 
that only one of the two approaches may find? 

Q6. Qualitative evaluation: what do the two different concepts of community 
look like? 

Q7. Scalability: how do the two methods perform on networks of different size? 

In order to address the above, we ran ABACUS and the algorithm proposed in 
[6] (from now on, the "baseline") on several subsets of the DBLP dataset. We 
took incrementally large subsets of DBLP|^ consisting in all the nodes, edges 
and dimensions contained in the single year 2010, the years 2009 and 2010, the 
years between 2008 and 2010, and so on, up to the years from 2000 to 2010. 

4.3.1 Quantitative evaluation 



Figure 11 ^a) reports the number of communities found by the two methods. 
As we see, due to the strategy of collapsing the multidimensional network to 
a monodimensional one, the number of communities found by the baseline 
becomes nearly stable after adding four years. In fact, after the first step, 
each additional year included into the subset is only changing the weight of 
existing edges, instead of creating new ones (and bringing new nodes). On the 
other hand, the search space of ABACUS grows consistently up to the last 



^ we did not test the comparison on Query Log for sake of simplicity 
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Fig. 11 Quantitative comparisons between ABACUS and the basehne (each x value cor- 
responds to an additional year included in the subset, from 2010 to 2000): (a) shows the 
number of communities found by the two methods; (b) shows the intersection of the sets of 
communities found (given A as the set of communities found by the baseline and B the set 
returned by ABACUS , the red bar shows |A \ n B| -i.e. the portion of communities 

found only by the baseline-, the green bar shows \B \ A\/\A n B\ -i.e. the portion of com- 
munities found only by ABACUS-, the blue bar shows |A U n B| -i.e. the portion of 
communities found by both. 



two or three steps, where the growth slows down. By keeping the dimensions 
separated, in fact, each additional year is able to provide a significant number 
of new combinations to the previous ones. Although the number of results 
returned by ABACUS is high, we have discussed in Q2 and Q2 how to deal 
with it. 

However, a few clear questions arise: are the two methods finding the same 
communities? Is one method returning communities found also by the com- 
petitor? Can we identify (classes of) communities that can be found only by 
one of the two methods? Figure [iTJb) partially answers these questions from a 
quantitative point of view. Calling A the set of communities found by the base- 
line and B the set returned by ABACUS , the red bar shows \A\B\/\An B\ 
-i.e. the portion of communities found only by the baseline-, the green bar 
shows |5\ 741/1^4 n^l -i.e. the portion of communities found only by ABACUS 
-, the blue bar shows |Au5|/|74n5| -i.e. the portion of communities found by 
both. Note that in order to compare the communities found we had to remove 
the multidimensional information contained in those found by ABACUS . This 
step is however correct, i.e. there cannot be two instances of the same set of 
node tied to two different sets of dimensions (itemsets) as this would violate 
the theory behind the closed itemsets. Note also that, in analogy with the 
majority of the works on community discovery, and on frequent pattern min- 
ing, we perform exact matching here, thus we are only counting the identical 
communities in this comparison. 

As we see, since the bars report relative numbers, the ratio of communi- 
ties that can be found only by the baseline decreases as the subset of years 
grows. Put in other words, even if we know that ABACUS is meant to find 
communities of a different type than the ones found by the baseline, we see 
how, for large datasets, the number of communities found only by the baseline 



ABACUS: Apriori-BAsed Community discovery in multidimensional networks 



19 



ICDE 




(a) community found (b)community found 

only by ABACUS only by the baseline 

Fig. 12 Examples of communities found by (a) only ABACUS and (b) only the baseline. 
Nodes in (a): Deepak Agarwal, Zhiyuan Chen, Nitin Gupta. Nodes in (b): Anika Awwal, 
Matthew Jin, Gul N. Khan, Anita Tinoln. In (b) we also depicted the other outgoing edges 
from each nodes that were present in the input data. For the nature of the two approaches, 
(a) could not be find with the baseline and (b) could not be found from ABACUS, as 
different nodes exist in different dimensions. 



become less relevant. In the following, we answer the above questions also from 
a qualitative perspective. 



4-3.2 Qualitative evaluation 

The two concepts of communities found by ABACUS and the baseline are 
different, without a clear winner (i.e., they just reflect different types of in- 
teractions among nodes). This situation can be also detected by the different 
classes of communities that only one of the two methods can find. Consider 
Figure [3| if that was the entire input, the baseline would collapse the network 
into a monodimensional one and possibly find only one community contain- 
ing all the four nodes. This cannot happen in ABACUS , as the principle 
for which the nodes are found in the same multidimensional community is to 
share memberships to monodimensional communities. That is, if Figure [sja) 
was the entire input, ABACUS would find Jon Doe and John Smith in a mul- 
tidimensional community, but not the entire set of nodes, as the remaining 
two do not share all the memberships to the other nodes (they do not exist in 



dimensions ICDM, CIKM and SIGMOD). Figure 12 shows two communities 
found during our comparison: (a) was found only by ABACUS , and (b) was 
found only by the baseline. Note that we depict all the edges in the original 
input, if there were any, and we reported in (b) also the outgoing edges. While 
it is clear that (a) cannot be found by the baseline (as it relies on connect- 
edness, but there are not edges among those nodes in the input), in order 
to confirm that (b) could not be found by ABACUS we had to investigate 
whether the three nodes were sharing memberships to the same communities 
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in the depicted dimensions. That is, even if the image is showing a community 
that could not be detected by ABACUS if the depicted edges were the entire 
input data, there might be in the data other edges (and paths) connecting the 
nodes. After post-processing the data, we found that this was not the case for 
(b), as different nodes are connected in different dimensions (see also outgoing 
edges). 



4.3.3 Scalability 

The last part of our comparison regards scalability, and we use this section 
also to show the scalability of ABACUS itself. Consider Figure [Tsj As we see, 
even though by adding years we implicitly add also dimensions (not all the 
conferences take place in all the years), this has a very low impact on the 
running time of ABACUS, and a very high impact for the baseline. This hap- 
pens despite we pass from roughly 400k edges to 2.7M edges for ABACUS, 
and from 370K to 1.9M edges for the baseline (the different number of edges 
for the same subsets is due to collapsing the edges belonging to different di- 



mensions). Note that in Figure 13 we report only the running times obtained 
with a minimum support of two. That is, we do not test the sensitivity to the 
minimum support parameter, as we already give the worst case. In reality, if 
looking for larger communities (depending on the application), the running 
times may be also lower. 

To conclude, ABACUS is scalable, and able to process our data in 32 to 
1200 seconds (20 minutes), while the baseline needed 380 (6 minutes) to 13500 
seconds (225 minutes, i.e. almost 4 hours). 



5 Related work 

Detecting communities in networks has been studied from many angles. Two 
comprehensive surveys on the topic can be found in |18H 23\ From one side, 
a community has been defined as a set of nodes with a high density of links 
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Fig. 13 Quantitative comparisons between ABACUS and the basehne (each x value corre- 
sponds to an additional year included in the subset, from 2010 to 2000): running times of 
the two approaches, in seconds. 
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among them, while sparse connections among the others. The papers working 
with this quantitative definition rely on information theory principles [29 or on 
the notion of modularity [15], which is a function defined to detect the ratio 
between intra- and inter-community number of edges. Modularity is widely 
used in many works, and several algorithms have been proposed to extract high 
modularity partitioning of a network: one of them is a greedy optimization 
able to scale up to networks with billions of edges [7]. From another side, 
communities have been approached looking at the statistical properties of 
the graph. In [20 , a framework for the detection of overlapping communities, 
i.e. communities allowing the vertices to be in more than one community, is 
presented. The framework is based on the "split betweenness" concept: vertices 
and edges are ranked by their betweenness centrality (the portion of shortest 
path in which they appear) and then split in order to form a transformed 
network, where classical algorithms can be used to detect communities. The 
resulting communities are then merged in order to find overlaps. Another class 
of approaches relies on the propagation in the network of a label [33 or a 
particular definition of structure (usually a clique [28]). The first approach 
is known for being a quasi linear solution for the problem, the second one 
allows to find overlapping communities. One algorithm that tries to maximize 
quality and quantity measures on its results is InfoMap [34^, a random walk- 
based algorithm. An emerging novel problem definition can be found in [2], 
in which authors state that community discovery algorithms should not group 
nodes but edges, emphasizing the role of the relation residing in a community. 
Previously described methods have focused on both unweighted or weighted 
graphs, but still considering the network as a monodimensional entity. Only 
since recently, multidimensionality has started to be taken into account in 
network analysis. Two examples of studies are: link prediction in networks 
with positive and negative links [22], and a statistical analysis over different 
kinds of relations in the same network in an online game community [36j . 

From a community discovery point of view, to the best of our knowledge, 
the main approaches to take into account multiple dimensions are three. In 
[26] the authors extend the definition of modularity to fit the multidimensional 
case, which they call "multislice" . In [37j the authors create a machine learn- 
ing procedure which detects the possible different latent dimensions among 
the entities in the network and uses them as features for the classification 
algorithm. Both approaches do not consider any definition of "multidimen- 
sional community", neither they characterize and analyze the communities 
found and their multidimensional structure; instead they try to define methods 
for dealing with multidimensional networks, but still extracting monodimen- 
sional communities as output. Another work dealing with networks containing 
heterogenous information, but not multiple dimensions, is presented in [35j, 
where the authors propose a method to generate net-clusters using links across 
multi-typed objects. In [6 , a possible formulation of community discovery and 
characterization in multidimensional networks was given. A new measure was 
introduced to capture the interplay among the dimensions, that makes mul- 
tidimensional communities emerge even where the connections among nodes 
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reside in different dimensions. In this paper, we approach the problem from a 
similar angle, but focus on extracting communities using the apriori, and giv- 
ing a semantic description to each multidimensional community as the subset 
of dimensions used to characterize it. Resulting multidimensional communi- 
ties can be different from the ones extracted in [6^ and are navigable using the 
lattice extracted in the frequent item set pattern mining process. 

The authors of [12] studied the problem of community mining in multi- 
relational networks. The problem setting, however, is different. The authors 
exploit the multi-relational links to evaluate the importance of the relations 
based on labeled examples, provided by a user as queries. 

The idea of applying closed frequent pattern mining to multi-relational 
data is not new. In [14 , the authors extract all closed n-sets satisfying given 
piecewise (anti-)monotonic constraints, from n-ary relations. However, they 
solve the technical problem of finding the frequent closed patterns, but do not 
apply this technique to any specific domain. 

There are other works in the literature that deal with the extraction of 
knowledge across networks. In [39^ and [41 , for example, the authors deal with 
the problem of finding cross-graph quasi-cliques. This problem can be seen as 
a sub-problem of the one we deal with in this paper. However, our concept 
of community is independent from the density of the connections among the 
nodes. 



6 Conclusions and future work 

In this paper, we have addressed the problem of multidimensional community 
discovery. We have given a definition of multidimensional community for which 
nodes sharing memberships to the same monodimensional communities in the 
different single dimensions are grouped together. This leads us to define a 
community extractor combining the use of 

— A given monodimensional community discovery algorithm (that could also 
allow for overlapping communities) 

— Frequent item set pattern mining to allow merging discovered monodimen- 
sional communities into multidimensional ones 

By browsing over the lattice generated by the apriori algorithm, it is possible 
to extract multidimensional communities of different sizes (pattern lengths) 
and so navigate the complex multidimensional structure of a network, in a 
way that previous methods could not permit. 

The proposed method could lead to the development of analytical tools to 
characterize the redundancy in the dimensions, the impact of new dimensions 
on the network structure, and more in general to evaluate the interplay be- 
tween dimensions. For these reasons, they have clear applications in real world 
problems including characterizing the interplay between mobility and commu- 
nication dimensions in a place-to-place network [13], the similarity between 
users in a user- mobility profile network [38 , or in the analysis spreading of 
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infectious diseases [Sj. We leave for future research the analysis of potential 
applications of ABACUS. 
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